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(57) A preliminary word-selecting section selects 
one or more words following words which have been ob- 
tained in a word string serving as a candidate for a result 
of voice recognition; and a matching section calculates 
acoustic or linguistic scores for the selected words, and 
forms a word string serving as a candidate for a result 
of voice recognition according to the scores. A control 
section generates word-connection relationships be- 
tween words in the word string serving as a candidate 
for a result of voice recognition, sends them to a word- 
connection-information storage section, and stores 
them in it. A re-evaluation section corrects the word-con- 
nection relationships stored in the word-connection-in- 
formation storage section 1 6, and the control section de- 
termines a word string serving as the result of voice rec- 
ognition according to the corrected word-connection re- 
lationships. 
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[0001 ] The present invention relates to voice recogn I- 
tion apparatus and voice recognition methods. 
[0002] The following paragraphs provide an indication 
of a technical problem to which the present invention is 
directed and an indication at least in part of a solution 
provided by embodiments of the present invention. 
[0003] Fig. 1 shows an example structure of a con- 
ventional voice recognition apparatus. 
[0004] Voice uttered by the user is input to a micro- 
phone 1 , and the microphone 1 converts the input voice 
to an audio signal, which is an electric signal. The audio 
signal is sent to an analog-to-digrtal (AD) conversion 
section 2. The AD conversion section 2 samples, quan- 
tizes, and converts the audio signal, which is an analog 
signal sent from the microphone 1 , into audio data which 
is a digital signal. The audio data is sent to a feature 
extracting section 3. 

[0005] The feature extracting section 3 applies acous- 
tic processing to the audio data sent from the AD con- 
version-section 2 in units of an appropriate number of 
frames to extract a feature amount, such as a Mel fre- 
quency cepstrum coefficient (MFCC), and sends it to a 
matching section 4. The feature extracting section 3 can 
extract other feature amounts, such as spectra, linear 
prediction coefficients, cepstrum coefficients, and line 
spectrum pairs. 

[0006] The matching section 4 uses the feature 
amount sent from the feature extracting section 3 and 
refers to an acoustic-model data base 5, a dictionary da- 
ta base 6, and a grammar data base 7, if necessary, to 
apply voice recognition, for example, by a continuous- 
distribution HMM method to the voice (input voice) input 
to the microphone 1 . 

[0007] More specifically, the acoustic-model data 
base 5 stores acoustic models indicating acoustic fea- 
tures of each phoneme and each syllable in a linguistic 
aspect of the voice to which voice recognition is applied. 
Since voice recognition is applied according to the con- 
tinuos-distribution hidden-Markov-model (HMM) meth- 
od, HMM is, for example, used as an acoustic model. 
The dictionary data base 6 stores a word dictionary in 
which information (phoneme information) related to the 
pronunciation of each word (vocabulary) to be recog- 
nized is described. The grammar data base 7 stores a 
grammar rule (language model) which describes how 
each word input into the word dictionary of the dictionary 
data base 6 is chained (connected). For example, the 
grammar rule may be a context free grammar (CFG) or 
a rule based on statistical word chain probabilities (N- 
gram). 

[0008] The matching section 4 connects acoustic 
models stored in the acoustic-model data base 5 by re- 
ferring to the word dictionary of the dictionary data base 
~ 6 to constitute word acoustic models (word models). The 
matching section 4 further connects several word mod- 
els by referring to the grammar rule stored in the gram- 
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mar data base 6, and uses tn^connected word models 
to recognize the voice input to the microphone 1 , by the 
continuous-distribution HMM method according to fea- 
ture amounts. In other words, the matching section 4 
5 detects a series of word models having the highest 
score (likelihood) in observing time-sequential feature 
amounts output from the feature extracting section 3, 
and outputs the word string corresponding to the series 
of word models as the result of voice recognition. 

10 [0009] In other words, the matching section 4 accu- 
mulates the probability of occurrence of each feature 
amount for word strings corresponding to connected 
word models, uses an accumulated value as a score, 
and outputs the word string having the highest score as 

is the result of voice recognition. 

[001 0] A score is generally obtained by the total eval- 
uation of an acoustic score (hereinafter called acoustics 
score, if necessary) given by acoustic models stored in 
the acoustic-model data base 5 and a linguistic score 

20 (hereinafter called language score, if necessary) given 
by the grammar rule stored in the grammar data base 7. 
[001 1 ] More specifically, the acoustics score is calcu- 
lated, for example, by the HMM method, for each word 
from acoustic models constituting a word model accord- 

25 in g to the probability (probability of occurrence) by which 
a series of feature amounts output from the feature ex- 
tracting section 3 is observed. The language score is 
obtained, for example, by bigram , according to the prob- 
ability of chaining (linking) between an aimed-at word 

30 and a word disposed immediately before the aimed-at 
word. The result of voice recognition is determined ac- 
cording to the final score (hereinafter called final score, 
if necessary) obtained from a total evaluation of the 
acoustics score and the language score for each word. 

35 [0012] Specifically, the final score S of a word string 
formed of N words Is, for example, calculated by the fol- 
lowing expression, where w k indicates the k-th word in 
the word string, A(Wfc) indicates the acoustics score of 
the word w k , and L(w k ) indicates the language score of 

40 the word. 

S = Z(A(w k ) + C k xL(w k )) (1) 

45 [0013] E indicates a summation obtained when k is 
changed from 1 to N. C k indicates a weight applied to 
the language score L(w k ) of the word w k . 
[001 4] The matching section 4 performs, for example, 
matching processing for obtaining N which makes the 

50 final score represented by the expression (1) highest 
and a word string w 1t w 2 , .... and w N , and outputs the 
word string w 1f W2, .... and w N as the result of voice rec- 
ognition. 

[0015] With the above-described processing, when 
55 the user utters "New York ni ikitai desu," the voice rec- 
ognition apparatus shown in Fig. 1 calculates an acous- 
tics score and a language score for each word, "New 
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York," "ni, u "ikitai," or "desu." WhdMKr final score ob- 
tained from a total evaluation is the highest, the word 
string, "New York," "ni," "ikitai," and "desu," is output as 
the result of voice recognition. 

[0016] In the above case, when five words, "New 
York," M ni, n "Ikitai," and "desu," are stored in the word 
dictionary of the dictionary data base 6, there are 5 5 
kinds of five-word arrangement which can be formed of 
these five words. Therefore, it can be said in a simple 
way that the matching section 4 evaluates 5 5 word 
strings and determines the most appropriate word string 
(word string havingthe highest final score) forthe user's 
utterance among them. If the number of words stored in 
the word dictionary increases, the number of word 
strings formed of the words is the number of words mul- 
tiplied by itself the-number-of -words times. Conse- 
quently, a huge number of word strings should be eval- 
uated. 

[001 7] In addition, since the number of words included 
In utterance Is generally unknown, not only word strings 
formed of five words but word strings formed of one 
word, two words, and ... should be evaluated. There- 
fore, the number of word strings to be evaluated be^ 
comes more huge. It is very important to efficiently de- 
termine the most likely word string among a huge 
number of word strings as the result of voice recognition 
in terms of the amount of calculation and a memory ca- 
pacity to be used. 

[0018] To make an efficient use of the amount of cal- 
culation and the memory capacity to be used, some 
measures are taken such as an acoustic branch-cutting 
technique for stopping score calculation when an acous- 
tics score obtained during a process for obtaining an 
acoustics score becomes equal to or less than a prede- 
termined threshold, or a linguistic branch-cutting tech- 
nique for reducing the number of words for which score 
calculation is performed, according to language scores. 
[0019] According to these branch-cutting techniques, 
since words for which score calculation is performed is 
reduced according to a predetermined determination 
reference (such as an acoustics score obtained during 
calculation, described above, and a language score giv- 
en to a word), the amount of calculation is reduced. If 
many words are reduced, namely, if a severe determi- 
nation reference is used, however, even a word which 
is to be correctly obtained as a result of voice recognition 
is also removed, and erroneous recognition occurs. 
Therefore, in the branch -cutting techniques, word re- 
duction needs to be performed with a margin provided 
to some extent so as not to remove a word which is to 
be correctly obtained as a result of voice recognition. 
Consequently, it Is difficult to largely reduce the amount 
of calculation. 

[0020] When acoustics scores are obtained inde- 
pendently for all words for which score calculation is to 
be performed, the amount of calculation is large. There- 
fore, a method has been proposed for making a com* 
mon use of (sharing) a part of acoustics-score calcula- 
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tion for a plurality of words^n this sharing method, a 
common acoustic model is applied to words stored in 
the word dictionary, having the same first phoneme, 
from the first phoneme to the same last phoneme, and 

5 acoustic models are independently applied to the sub- 
sequent phonemes to constitute one tree-structure net- 
work as a whole and to obtain acoustics scores. More 
specifically, for example, the words, "akita" and "akebo- 
no," are considered. When it is assumed that the pho- 

10 neme Information of "aklta" Is "aklta" and that of "ake- 
bono" is "akebono," the acoustics scores of the words, 
"akita" and "akebono," are calculated in common forthe 
first to second phonemes "a" and "k. n Acoustics scores 
are independently calculated for the remaining pho- 

is nemes "I," "t," and "a" of the word "akita" and the remain- 
ing phonemes "e," "b," "o," "n," and "o" of the word "ake- 
bono." 

[0021] Therefore, according to this method, the 
amount of calculation performed for acoustics scores is 

20 largely reduced. 

[0022] In this method, however, when a common part 
is calculated (acoustics scores are calculated in com- 
mon), the word for which acoustics scores are being cal- 
culated cannot be determined. In other words, in the 

25 above example of the words, "akita" and "akebono," 
when acoustics scores are being calculated for the first 
and second phonemes °a" and "k," It cannot be deter- 
mined whether acoustics scores are calculated for the 
word "akita" or the word "akebono." 

30 [0023] in this case, as for "akita," when the calculation 
of an acoustics score starts for its third phoneme, "i," It 
can be determined that the word for which the calcula- 
tion is being performed is "akita." Also as for "akebono, 
" when the calculation of an acoustics score starts for 

35 its third phoneme, "e," it can be determined that the word 

for which the calculation Is being performed Is "akebono. 
ii 

[0024] Therefore, when a part of acoustics-score cal- 
culation is shared, a word for which the calculation is 

40 being performed cannot be identified when the acous- 
tics-score calculation starts. As a result, it is difficult to 
use the above-described linguistic branch-cutting meth- 
od before the start of acoustics -score calculation. 
Wasteful calculation may be performed. 

45 [0025] In addition, when a part of acoustics-score cal- 
culation is shared, the above-described tree-structure 
network is formed for all words stored in the word dic- 
tionary. A large memory capacity is required to hold the 
network. 

so [0026] To make an efficient use of the amount of cal- 
culation and the memory capacity to be used, another 
technique may be taken in which acoustics scores are 
calculated not for all words stored in the word dictionary 
but only for words preliminarily selected. The prelimi- 

ss nary selection is performed by using, for example, sim- 
ple acoustic models or a simple grammar rule which 
does not have very high precision. 
[0027] A method for preliminary selection is de- 
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scribed, for example, in "A Fast AUVximate Acoustic 
Match for Large Vocabulary Speech Recognition," IEEE 
Trans. Speech and Audio Proa, vol. 1, pp.59-67, 1993, 
written by L.R.Bahl, S.V.De Gennaro, P.S. 
Gopalakrishnan and R.L.Mercer. s 
[0026] The acoustics score of a word is calculated by 
using a series of feature amounts of voice. When the 
starting point or the ending point of a series of a feature 
amount to be used for calculation is different, an acous- 
tics score to be obtained is also changed. This change 10 
affects the final score obtained by the expression (1), in 
which an acoustics -score and a language score are to- 
tally evaluated. 

[0029] The starting point and the ending point of the 
series of feature amounts corresponding to a word, 15 
namely, the boundaries (word boundaries) of words, can 
be obtained, for example, by a dynamic programming 
method. A point in the series of a feature amount Is set 
to a candidate for a word boundary, and a score (here- 
inafter called a word score, if necessary) obtained by 20 
totally evaluating an acoustics score and a language 
score is accumulated for each word in a word string, 
which serves as a candidate for a result of voice recog- 
nition. The candidates for word boundaries which give 
the highest accumulated values are stored together with 25 
the accumulated values. 

[0030] When the accumulated values of word scores 
have been obtained, word boundaries which give the 
highest accumulated values, namely, the highest 
scores, are also obtained. 30 
[0031] The method for obtaining word boundaries in 
the above way is called Viterbi decoding or one-pass 
decoding, and its details are described, for example, in 
"Voice Recognition Using Probability Model," the Jour- 
nal of the institute of Electronics, Information and Com- 35 
municatlon Engineers, pp.20-26, July, 1, 1988, written 
by Seiichi Nakagawa. 

[0032] To effectively perform the above-described 
preliminary selection, it is very important to determine 
word boundaries, that is, to determine a starting point in 
a series (feature-amount series) of a feature amount. 
[0033] Specifically, in a feature-amount series ob- 
tained from a speech "kyouwaiitenkidesune" shown in 
Fig. 2(A), for example, when a correct word boundary is 
disposed at time t, between "kyou° and "wa," if time t^, 45 
which precedes the correct timet 1 , is selected as a start- 
ing point in preliminary selection for the word "wa" fol- 
lowing the word "kyou," not only the feature amount of 
the word "wa" but also the last portion of the feature 
amount of the word "kyou" affects the preliminary selec- so 
tion. If time t 1+1 , which follows the correct timetj, is se- 
lected as a starting point in preliminary selection for the 
word "wa," the beginning portion of the feature amount 
of the word "wa" is not used in the preliminary selection. 
[0034] In either case, if a starting point is erroneously ss 
selected, an adverse effect is given to preliminary se- 
lection and then to matching processing performed 
thereafter. 



[0035] In Fig. 2 (also in Fi^s and Fig. 7, described 
later), time passes in a direction from the left to the right. 
The starting time of a voice zone is set to 0, and the 
ending time is set to time T. 

[0036] In the dynamic programming method, de- 
scribed above, since final word boundaries cannot be 
determined until word scores (acoustics scores and lan- 
guage scores) have been calculated to the end of a fea- 
ture-amount series, that is, to the ending time T of the 
voice zone In Fig. 2, It Is difficult to uniquely determine 
word boundaries which serve as starting points in pre- 
liminary selection when the preliminary selection is per- 
formed. 

[0037] To solve this issue, a technique has been pro- 
posed in which candidates for word boundaries are held 
until word scores have been calculated by using a fea- 
ture-amount series in a voice zone. 
[0038] In this technique, when a word score is calcu- 
lated for the word "kyou" with the starting time 0 of the 
voice zone being used as a start point, and times t 1a1 , 
ti, and t 1+1 are obtained as candidates for the ending 
point of the utterance of the word "kyou," for example, 
these three times t 1 . 1 , t 1f and t«j +1 are held and prelimi- 
nary selection for the next word is executed with each 
of these times being used as a starting point. 
[0039] In the preliminary selection, it is assumed that, 
when the time t-,.-, is used as a starting point, two words 
"wa" and "ii" are obtained; when the time t 1 is used as 
a starting point, one word "wa" is obtained; and when 
the time t 1+1 is used as a starting point, two words "wa" 
and "ii" are obtained. It is also assumed that a word 
score is calculated for each of these words and results 
shown in Fig. 2(B) to Fig. 2(G) are obtained. 
[0040] Specifically, Fig. 2(B) shows that a word score 
is calculated for the word "wa" with the time t 1 . 1 being 
used as a starting point and time is obtained as a can- 
didate for an ending point. Fig. 2(C) shows that a word 
score is calculated for the word "ii" with the time t^ be- 
ing used as a starting point and time t^+i is obtained as 
a candidate for an ending point. Fig. 2(D) shows that a 
word score Is calculated for the word "wa" with the time 
t 1 being used as a starting point and time is obtained 
as a candidate for an ending point. Fig. 2(E) shows that 
a word score is calculated for the word "wa" with the 
time t 1 being used as a starting point and time is ob- 
tained as a candidate for an ending point. Fig. 2(F) 
shows that a word score is calculated for the word "wa" 
with the time t 1 +1 being used as a starting point and time 
t2 is obtained as a candidate for an ending point. Fig. 2 
(G) shows that a word score is calculated for the word 
"ii" with the time t 1+1 being used as a starting point and 
time t2 +2 is obtained as a candidate for an ending point. 
In Fig. 2, t^, < ^ < t 1+1 < tg < t 2+1 < W 
[0041] Among Fig. 2(B) to Fig. 2(G), Fig. 2(B), Fig. 2 
(E), and Fig. 2(F) show that the same word string, "kyou" 
and "wa," are obtained as a candidate for a result of 
voice recognition, and that the ending point of the last 
word "wa" of the word string is at the time tg. Therefore, 
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it is possible that the most appropl^rcase is selected 
among them, for example, according to the accumulat- 
ed values of the word scores obtained up to the time ^ 
and the remaining cases are discarded. 
[0042] At the current point of time, however, a correct s 
case cannot be identified among a case selected from 
those shown in Fig. 2(B), Fig. 2(E), and Fig. 2(F), plus 
cases shown in Fig. 2(C), Fig. 2(D), and Fig. 2(G). 
Therefore, these four cases need to be held. Preliminary 
selection Is again executed for these four cases. 10 
[0043] Therefore, in this technique, word scores need 
to be calculated while many word-boundary candidates 
are held until word-score calculation using a feature- 
amount series in a voice zone is finished. It is not pre- 
ferred In terms of an efficient use of the amount of cal- 15 
culation and the memory capacity. 
[0044] Also in this case, when truly correct word 
boundaries are held as candidates for word boundaries, 
the same correct word boundaries are finally obtained 
in principle as those obtained in a case in which the 20 
above-described dynamic programming technique is 
used. If a truly correct word boundary is not held as a 
candidate for a word boundary, a word having the word 
boundary as its starting point or as its ending point is 
erroneously recognized, and in addition, due to this er- 25 
roneous recognition, a word following the word may be 
erroneously recognized. 

[0045] In recentyears, acoustic models which depend 
on (consider) contexts have been used. Acoustic mod- 
els depending on contexts mean acoustic models even so 
for the same syllable (or phoneme) which have been 
modeled as different models according to a syllable dis- 
posed immediately before or immediately after. There- 
fore, for example, a syllable "a" is modeled by different 
acoustic models between cases in which a syllable dis- 35 
posed immediately before or immediately after is "ka" 
and "sa." 

[0046] Acoustic models depending on contexts are di- 
vided into those depending on contexts within words 
and those depending on contexts which extend over *o 
words. 

[0047] In a case in which acoustic models depending 
on contexts within words are used, when a word model 
"kyou" is generated by coupling acoustic models "kyo" 
and "u," an acoustic model "kyo" depending on the syl- *s 
iable "u" coming immediately thereafter (acoustic model 
"kyo" with the syllable "u" coming immediately thereafter 
being considered) is used, or an acoustic model "u" de- 
pending on the syllable "kyo" coming immediately ther- 
ebefore is used. so 
[0048] In a case in which acoustic models depending 
on contexts which extend over words are used, when a 
word model "kyou" is generated by coupling acoustic 
models "kyo" and "u," if the word coming immediately 
thereafter is "wa," an acoustic model "u n depending on ss 
the first syllable "wa" of the word coming immediately 
thereafter. Acoustic models depending on contexts 
which extend over words are called cross-word models . 
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[0049] When cross-wordTTTodels are applied to voice 
recognition which performs preliminary selection, a re- 
lationship with a word disposed immediately before a 
preliminarily selected word can be taken into account, 
but a relationship with a word disposed immediately af- 
ter the preliminarily selected word cannot be considered 
because the word coming immediately thereafter is not 
yet determined. 

[0050] To solve this problem, a method has been de- 
veloped in which a word which is highly likely to be dis- 
posed immediately after a preliminarily selected word is 
obtained in advance, and a word model is created with 
the relationship with the obtained word taken into ac- 
count. More specifically, for example, when words "wa, 
" "ga," and "no" are highly likely to be disposed Immedi- 
ately after the word "kyou," the word model Is generated 
by using acoustic models "u" depending on "wa," "ga," 
and "no," which correspond to the last syllable of word 
models for the word "kyou." 

[0051] Since unnecessary contexts are always taken 
into account, however, this method is not desirable in 
terms of an efficient use of the amount of calculation and 
the memory capacity. 

[0052] For the same reason, it is difficult to calculate 
the language score of a preliminarily selected word with 
the word disposed immediately thereafter being taken 
into account. 

[0053] As a voice recognition method in which not on- 
ly a word preceding an aimed-at word but also a word 
following the aimed-at word are taken into account, 
there has been proposed a two-pass decoding method, 
described, for example, in "The N-Best Algorithm: An 
Efficient and Exact Procedure for Finding The Most 
Likely Sentence Hypotheses," Proc. ICASSP, pp.81 -84, 
1990, written by R.Schwarts and Y.L. Chow. 
[0054] Fig. 3 shows an outlined structure of a conven- 
tional voice recognition apparatus which executes voice 
recognition by the two-pass decoding method. 
[0055] In Fig. 3, a matching section 4 1 performs, for 
example, the same matching processing as the match- 
ing section 4 shown In Fig. 1 , and outputs a word string 
obtained as the result of the processing. The matching 
section 4^ does not output only one word string serving 
as the final voice-recognition result among a plurality of 
word strings obtained as the results of the matching 
processing, but outputs a plurality of likely word strings 
as candidates for voice- recognition results. 
[0056] The outputs of the matching section 4 1 are sent 
to a matching section ^ The matching section 4 2 per 
forms matching processing for re-evaluating the proba- 
bility of determining each word string among the plurality 
of word strings output from the matching section 4 t , as 
the voice-recognition result. In a word string output from 
the matching section 41 as a voice-recognition result, 
since a word has not only a word disposed immediately 
therebefore but also a word disposed immediately 
thereafter, the matching section 4^ uses cross-word 
models to obtain a new acoustics score and a new lan- 
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guage score with not only the woUBfeposed immedi- 
ately therebefore but also the word disposed immedi- 
ately thereafter being taken into account. The matching 
section 4g determines and outputs a likely word string 
as the voice-recognition result according to the new 
acoustics score and language score of each word string 
among the plurality of word strings output from the 
matching section 4 V 

[0057] In the two-pass decoding, described above; 
generally, simple acoustic models, a word dictionary, 
and a grammar rule which do not have high precision 
are used in the matching section 4j , which performs first 
matching processing, and acoustic models, a word dic- 
tionary, and a grammar rule which have high precision 
are used In the matching section 4 2 , which performs 
subsequent matching processing. With this configura- 
tion, in the voice recognition apparatus shown in Fig. 3, 
the amounts of processing performed in the matching 
sections 4 1 and 42 are both reduced and a highly precise 
voice-recognition result fs obtained. 
[0058] Fig. 3 shows a two-pass-decoding voice rec- 
ognition apparatus, as described above. There has also 
been proposed a voice-recognition apparatus which 
performs multi-pass decoding, in which the same 
matching sections are added after the matching section 
4 2 shown in Fig. 3. 

[0059] In two-pass decoding and multi-pass decod- 
ing, however, until the first matching processing has 
been finished, the next matching processing cannot be 
achieved. Therefore, a delay time measured from when 
a voice is Input to when the final voice-recognition result 
is output becomes long. 

[0060] To solve this problem, there has been pro- 
posed a method in which, when first matching process- 
ing has been finished for several words, subsequent 
matching processing is performed for the several words 
with cross-word models being used, and this operation 
is repeated for other words. The method is described, 
for example, in "Evaluation of a Stack Decoder on a Jap- 
anese Newspaper Dictation Task," Onkoron, 1-R-12, 
pp. 141 -142, 1997, written by M.Schuster. 
[0061] Preliminary selection is generally performed 
by using simple acoustic models and a grammar rule 
which do not have high precision. Since preliminary se- 
lection is applied to all words stored in the word diction- 
ary, when preliminary selection is performed with highly 
precise acoustic models and a highly precise grammar 
rule, a large amount of resources, such as the amount 
of calculation and a memory capacity, is required to hold 
a real-time feature. Therefore, with the use of simple 
acoustic models and a simple grammar rule, preliminary 
selection is executed at a high speed with relatively 
smaller resources even for a large vocabulary. 
[0062] In preliminary selection, however, after match- 
ing processing is performed for a word by using a fea- 
ture-amount series and a likely ending point is obtained, 
the ending point Is set to a starting point and matching 
processing is again performed by using a feature- 



ft 



amount series from the timecoTresponding to the start- 
ing point. In other words, preliminary selection is per- 
formed when boundaries (word boundaries) between 
words included in a voice continuously uttered have not 

s yet finally determined. 

[0063] Therefore, if the starting point and the ending 
point of a feature-amount series used in preliminary se- 
lection are shifted from the starting point and the ending 
point of the corresponding word, preliminary selection 

10 is performed by using a feature-amount series Including 
the feature amount of a phoneme included in a word dis- 
posed immediately before the corresponding word or a 
word disposed immediately after the corresponding 
word, or by using a feature-amount series in which the 

is feature amount of the beginning or last portion of the 
corresponding word is missing, that is, by using a fea- 
ture-amount series which is acoustically not stable. 
[0064] Therefore, in preliminary selection using sim- 
ple acoustic models, it may happen that a word included 

20 in an utterance is not selected. If a correct word Is not 
selected in preliminary selection, since matching 
processing is not performed for the word, an erroneous 
voice-recognition result is obtained. 
[0065] To solve this problem, for preliminary selection, 

25 there has been proposed a method for widening an 
acoustic or linguistic determination reference used for 
selecting a word to increase the number of selected 
words, and a method in which highly precise acoustic 
models and a highly precise grammar rule are used. 

30 [0066] When an acoustic or linguistic determi nation 
reference used for selecting a word is widened in pre- 
liminary selection, however, matching processing is ap- 
plied to many words which are not likely to be voice- 
recognition results, and an increasing amount of re- 

3s sources, such as the amount of calculation and a mem- 
ory capacity, Is required for matching processing, which 
has a heavier load per word than preliminary selection. 
[0067] When highly precise acoustic models and a 
highly precise grammar rule are used in preliminary se- 

40 lection, an increasing amount of resources is required 
for preliminary selection. 

[0068] Various aspects and features of the present in- 
vention are defined in the appended claims. 
[0069] In one aspect of the present invention there is 

45 provided a voice recognition apparatus for calculating a 
score indicating the likelihood of a result of voice recog- 
nition applied to an input voice and for recognizing the 
voice according to the score, including selecting means 
for selecting one or more words following words which 

50 have been obtained in a word string serving as a candi- 
date for a result of the voice recognition, from a group 
of words to which voice recognition Is applied; forming 
means for calculating the scores for the words selected 
by the selecting means, and for forming a word string 

55 serving as a candidate for a result of the voice recogni- 
tion according to the scores; storage means for storing 
word-connection relationships between words in the 
word string serving as a candidate for a result of the 
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voice recognition; correction meaBror correcting the 
word-connection relationships; and determination 
means for determining a word string serving as the re- 
sult of the voice recognition according to the corrected 
word-connection relationships. 

[0070] Embodiments of the present invention relate to 
voice recognition apparatuses, voice recognition meth- 
ods, and recording media, and more particularly, to a 
voice recognition apparatus, a voice recognition meth- 
od, and a recording medium which allow the precision 
of voice recognition to be improved. 
[0071] The present invention has been made in con- 
sideration of the above conditions. Embodiments of the 
present invention can perform highly precise or at least 
improved precision voice recognition while an increase 
of resources required for processing is suppressed or 
at least reduced. 

[0072] The storage means may store the connection 
relationships by using a graph structure expressed by a 
node and an arc. 

[0073] The storage means may store nodes which 
can be shared as one node. 

[0074] The storage means may store the acoustic 
score and the linguistic score of each word, and the 
starting time and the ending time of the utterance cor- 
responding to each word, together with the connection 
relationships between words. 

[0075] The voice recognition apparatus may be con- 
figured such that the forming means forms a word string 
serving as a candidate for a result of the voice recogni- 
tion by connecting the words for which the scores are 
calculated to a word for which a score has been calcu- 
lated, and the correction means sequentially corrects 
the connection relationships every time a word is con- 
nected by the forming means. 

[0076] The selecting means or the forming means 
may perform processing while referring to the connec- 
tion relationships. 

[0077] The selecting means, the forming means, or 
the correction means may calculate an acoustic or lin- 
guistic score for a word, and perform processing accord- 
ing to the acoustic or linguistic score. 
[0078] The selecting means, the forming means, or 
the correction means may calculate an acoustic or lin- 
guistic score for each word independently. 
[0079] The selecting means, the forming means, or 
the correction means may calculate an acoustic or lin- 
guistic score for each word independently in terms of 
time. 

[0080] The correction means may calculate an acous- 
tic or linguistic score for a word by referring to the con- 
nection relationships with a word disposed before or af- 
ter the word for which a score is to be calculated being 
taken into account 

[0081] In another aspect of the present invention 
there is provided a voice recognition method for calcu- 
lating a score indicating the likelihood of a result of voice 
recognition applied to an input voice and for recognizing 



the voice according to the sCore, including a selecting 
step of selecting one or more words following words 
which have been obtained in a word string serving as a 
candidate for a result of the voice recognition, from a 

5 group of words to which voice recognition is applied; a 
forming step of calculating the scores for the words se- 
lected in the selecting step, and of forming a word string 
serving as a candidate for a result of the voice recogni- 
tion according to the scores; a correction step of correct- 

10 ing word-connection relationships between words In the 
word string serving as a candidate for a result of the 
voice recognition, the word-connection relationships be- 
ing stored in storage means; and a determination step 
of determining a word string serving as the result of the 

is voice recognition according to the corrected word-con- 
nection relationships. 

[0082] In another aspect of the present invention 
there is provided a recording medium storing a program 
which makes a computer execute voice-recognition 
20 processing for calculating a score Indicating the likeli- 
hood of a result of voice recognition applied to an input 
voice and for recognizing the voice according to the 
score, the program including a selecting step of select- 
ing one or more words following words which have been 
25 obtained in a word string serving as a candidate for a 
result of the voice recognition, from a group of words to 
which voice recognition is applied; a forming step of cal- 
culating the scores for the words selected in the select- 
ing step, and of forming a word string serving as a can- 
so didate for a result of the voice recognition according to 
the scores; a correction step of correcting word-connec- 
tion relationships between words in the word string serv- 
ing as a candidate for a result of the voice recognition, 
the word-connection relationships being stored in stor- 
es age means; and a determination step of determining a 
word string serving as the result of the voice recognition 
according to the corrected word-connection relation- 
ships. 

[0083] The invention will now be described by way of 
40 example with reference to the accompanying drawings, 
throughout which like parts are referred to by like refer- 
ences, and in which: 



45 



50 



55 



Fig. 1 is a block diagram of a conventional voice 
recognition apparatus. 

Fig. 2 is a view showing a reason why candidates 
for boundaries between words need to be held. 
Fig. 3 is a block diagram of another conventional 
voice recognition apparatus. 
Fig. 4 is a block diagram of a voice recognition ap- 
paratus according to an embodiment of the present 
Invention. 

Fig. 5 is a view showing word-connection informa- 
tion. 

Fig. 6 is a flowchart of processing executed by the 
voice recognition apparatus shown in Fig. 4. 
Fig. 7 Is a view showing processing executed by a 
re-evaluation section 15. 
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Fig. 8 is a block diagram of SHPhputer according 
to another embodiment of the present invention. 

[0084] Fig. 4 shows an example structure of a voice 
recognition apparatus according to an embodiment of 
the present invention. In Fig. 4, the same symbols as 
those used in Fig. 1 are assigned to the portions corre- 
sponding to those shown in Fig. 1 , and a description 
thereof will be omitted. 

[0085] Series of feature amounts of the voice uttered 
by the user, output from a feature extracting section 3 
are sent to a control section 11 in units of frames. The 
control section 11 sends the feature amounts sent from 
the feature extracting section 3, to a feature-amount 
storage section 12. 

[0086] The control section 1 1 controls a matching sec- 
tion 14 and a re-evaluation section 15 by referring to 
word-connection information stored in a word-connec- 
tion-information storage section 1 6. The control section 
11 also generates word-connection information accord- 
ing to acoustics scores and language scores obtained 
in the matching section 14 as the results of the same 
matching processing as that performed in the matching 
section 4 shown in Fig. 1 , and, by that word-connection 
information, updates the storage contents of the word- 
connection information storage section 16. The control 
section 11 further corrects the storage contents of the 
word-connection -information storage section 16 ac- 
cording to the output of the re-evaluation section 15. In 
addition, the control section 11 determines and outputs 
the final result of voice recognition according to the 
word-connection information stored in the word-connec- 
tion-information storage section 16. 
[0087] The feature-amount storage section 1 2 stores 
series of feature amounts sent from the control section 
11 until, for example, the result of user's voice recogni- 
tion is obtained. The control section 11 sends a time 
(hereinafter called an extracting time, if necessary) 
when a feature amount output from the feature extract- 
ing section 3 is obtained with the starting time of a voice 
zone being set to a reference (for example, zero), to the 
feature-amount storage section 1 2 together with the fea- 
ture amount. The feature-amount storage section 12 
stores the feature amount together with the extracting 
time. The feature amount and the extracting time stored 
in the feature-amount storage section 12 can be referred 
to, if necessary, by a preliminary word-selecting section 
13, the matching section 14, and the re-evaluation sec- 
tion 15. 

[0088] In response to a request from the matching 
section 14, the preliminary word-selecting section 13 
performs preliminary word-selecting processing for se- 
lecting one or more words to which the matching section 
14 applies matching processing, with the use of the fea- 
ture amounts stored in the feature-amount storage sec- 
tion 12 by referring to the word-connection-information 
storage section 16, an acoustic-model data base 17A, 
a dictionary data base 1 8A, and a grammar data base 



1 9A, if necessary. 
[0089] Under the control of the control section 1 1 , the 
matching section 1 4-applies matching processing to the 
words obtained by the preliminary word-selecting 

5 processing in the preliminary word-selecting section 13, 
with the use of the feature amounts stored in the feature- 
amount storage section 1 2 by referring to the word-con- 
nection-information storage section 16, an acoustic- 
model data base 17B, a dictionary data base 18B, and 

10 a grammar data base 1 9B, if necessary, and sends the 
result of matching processing to the control section 11 . 
[0090] Under the control of the control section 1 1 , the 
re-evaluation section 15 re-evaluates the word-connec- 
tion information stored in the word-connection-informa- 

15 tlon storage section 16, with the use of the feature 
amounts stored in the feature-amount storage section 
12 by referring to an acoustic-model data base 17C, a 
dictionary data base 18C, and a grammar data base 
1 9C, if necessary, and sends the result of re-evaluation 

20 to the control section 1 1 . 

[0091 ] The word-connection-information storage sec- 
tion 16 stores the word-connection information sent 
from the control section 1 1 until the result of user's voice 
recognition is obtained. 

25 [0092] The word^connection information indicates 
connection (chaining or linking) relationships between 
words which constitute word strings serving as candi- 
dates for the final result of voice recognition, and in- 
cludes the acoustics score and the language score of 

30 each word and the starting time and the ending time of 
the utterance corresponding to each word. 
[0093] Fig. 5 shows the word-connection information 
stored in the word-connection-information storage sec- 
tion 16 by using a graph structure, 

35 [0094] In the embodiment shown in Fig. 5, the graph 
structure indicating the word-connection Information is 
formed of arcs (portions indicated by segments connect- 
ing marks O in Fig. 5) indicating words and nodes (por- 
tions indicated by marks O in Fig. 5) indicating bounda- 

40 ries between words. 

[0095] Nodes have time Information which Indicates 
the extracting time of the feature amounts correspond- 
ing to the nodes. As described above, an extracting time 
shows a time when a feature amount output from the 

4£ feature extracting section 3 is obtained with the starting 
time of a voice zone being set to zero. Therefore, in Fig. 
5, the start of a voice zone, namely, the time information 
which the node Node., corresponding to the beginning 
of a first word has is zero. Nodes can be the starting 

so ends and the ending ends of arcs. The time information 
which nodes (starting-end nodes) serving as starting 
ends have or the time Information which nodes (ending- 
end nodes) serving as ending ends have are the starting 
time or the ending time of the utterances of the words 

55 corresponding to the nodes, respectively. 

[0096] In Fig. 5, time passes in the direction from the 
left to the right. Therefore, between nodes disposed at 
the left and right of an arc, the left-hand node serves as 
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nd node serves 



the starting-end node and the i 
as the ending-end node. 

[0097] Arcs have the acoustics scores and the lan- 
guage scores of the words corresponding to the arcs. 
Arcs are sequentially connected by setting an ending 5 
node to a starting node to form a series of words serving 
as a candidate for the result of voice recognition. 
[0098] More specifically, the control section 11 first 
connects the arcs corresponding to words which are 
likely to serve as the results of voice recognition to the 10 
node Node., indicating the start of a voice zone. In the 
embodiment shown in Fig. 5, an arc Arc^ corresponding 
to "kyou, n an arc Arc 6 corresponding to "ii," and an arc 
Arc^ corresponding to "tenki" are connected to the node 
Node.,. It is determined according to acoustics scores 15 
and language scores obtained by the matching section 
14 whether words are likely to serve as the results of 
voice recognition. 

[0099] Then , in the same way, the arcs corresponding 
to likely words are connected to a node Node 2 serving 20 
as the ending end of the arc Arc 1 corresponding to "kyo, 
" to an ending node Node 7 serving as the ending end of 
the arc Arc6 corresponding to "ii, 0 and to a node Node 12 
serving as the ending end of the arc Arc^ corresponding 
to "tenki." 25 
[0100] Arcs are connected as described above to 
form one or more passes formed of arcs and nodes in 
the direction from the left to the right with the start of the 
voice zone being used as a starting point. When all 
passes reach the end (time T in the embodiment shown so 
In Fig. 5) of the voice zone, for example, the control sec- 
tion 11 accumulates the acoustics scores and the lan- 
guage scores which arcs constituting each pass formed 
from the start to the end of the voice zone have, to obtain 
final scores. The series of words corresponding to the 35 
arcs constituting the pass which has the highest final 
score is determined to be the result of voice recognition 
and output. 

[0101] Specifically, in Fig. 5, when the highest final 
score is obtained for a pass formed of the node Node., , 40 
the arc Arc 1 corresponding to "kyou," the node Node 2 , 
the arc ArCg corresponding to "wa," a node Node 3 , an 
arc Arc3 corresponding to "ii," a node Node 4 , an arc Arc 4 
corresponding to ''tenki," a node Nodes, an arc Arc 5 cor- 
responding to "desune," and a node Node 6 , for exam- 45 
ple,.a series of words, "kyou," "wa," "ii," "tenki," and 
"desune," is output as the result of voice recognition. 
[0102] In the above case, arcs are always connected 
to nodes disposed within the voice zone to form a pass 
extending from the start to the end of the voice zone, so 
During a process for forming such a pass, it is possible 
that, when it is clear from a score for a pass which has 
been made so far that the pass is inappropriate as the 
result of voice recognition, forming the pass is stopped 
(an arc is not connected any more). ss 
[0103] According to the above pass forming rule, the 
ending end of one arc serves as the starting-end nodes 
of one or more arcs to be connected next, and passes 



ii 
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are basically formed as brarfches and leaves spread. 
There is an exceptional case in which the ending end of 
one arc matches the ending end of another arc, namely, 
the ending-end node of an arc and the ending end of 
another arc are used as an identical node in common. 
[01 04] When big ram is used as a grammar rule, if two 
arcs extending from different nodes correspond to an 
identical word, and the same ending time of the utter- 
ance of the word is used, the ending ends of the two 
arcs match. 

[0105] In Fig. 5, an arc Arcy extending from a node 
Node 7 used as a starting end and an arc Arc 13 extending 
from a node Node 13 used as a starting point both cor- 
respond to "tenki," and the same ending time of the ut- 
terance Is used, the ending nodes thereof are used as 
an identical node Node e in common. 
[0106] it is also possible that nodes are always not 
used in common. In the viewpoint of the efficient use of 
a memory capacity, it is preferred that two ending nodes 
may match. 

[0107] In Fig. 5, bigram is used as a grammar rule. 
Even when other rules, such as trigram, are used, it is 
possible to use nodes in common. 
[01 08] The preliminary word-selecting section 1 3, the 
matching section 14, and the re-evaluation section 15 
can refer to the word-connection information stored in 
the word-connection-information storage section 16, if 
necessary. 

[0109] Back to Fig. 4, the acoustic-model data bases 
1 7A, 1 7B, and 1 7C basically store acoustic models such 
as those stored in the acoustic-model data base 5 
shown in Fig. 1 , described before. 
[0110] The acoustic-model data base 17B stores 
highly precise acoustic models to which more precise 
processing can be applied than that applied to acoustic 
models stored in the acoustic-model data base 1 7A. The 
acoustic-model data base 17C stores highly precise 
acoustic models to which more precise processing can 
be applied than that applied to the acoustic models 
stored in the acoustic-model data base 1 7B. More spe- 
cifically, when the acoustic-model data base 1 7A stores, 
for example, one-pattern acoustic models which do not 
depend on the context for each phoneme and syllable, 
the acoustic-model data base 1 7B stores, for example, 
acoustic models which depend on the context extending 
over words, namely cross-word models as well as 
acoustic models which do not depend on the context for 
each phoneme and syllable. In this case, the acoustic- 
model data base 17C stores, for example, acoustic 
models depending on the context within words in addi- 
tion to acoustic models which do not depend on the con- 
text and cross-word models. 

[0111] The dictionary data base 18A, 1BB, and 18C 
basically store a word dictionary such as that stored in 
the dictionary data base 6 shown in Fig. 1 , described 
above. 

[0112] Specifically, the same set of words is stored in 
the word dictionaries of the dictionary data bases 18A 
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to 18C. The word dictionary of the^BFbnary data base 
1 8B stores highly precise phoneme information to which 
more precise processing can be applied than that ap- 
plied to phoneme information stored in the word diction- 
ary of the dictionary data base 1 8A. The word dictionary 5 
of the dictionary data base 18C stores highly precise 
phoneme information to which more precise processing 
can be applied than that applied to the phoneme infor- 
mation stored in the word dictionary of the dictionary da- 
ta base 18B. More specifically, when only one piece of 10 
phoneme information (reading) is stored for each word 
in the word dictionary of the dictionary data base 18A, 
for example, a plurality of pieces of phoneme informa- 
tion is stored for each word in the word dictionary of the 
dictionary data base 18B. In this case, for example, 15 
more pieces of phoneme information is stored for each 
word in the word dictionary of the dictionary data base 
18C. 

[011 3] Concretely, for example, for the word "ohayou, 
■ one piece of phoneme information, "ohayou," is stored 20 
in the word dictionary of the dictionary data base 18A, 
"ohayoo" and "ohayo" as well as "ohayou" are stored as 
phoneme information in the word dictionary of the dic- 
tionary data base 18B, and "hayou" and "hayoo" in ad- 
dition to "ohayou," "ohayoo," and "ohayo" are stored as 25 
phoneme information in the word dictionary of the dic- 
tionary data base 1 8C. 

[0114] The grammar data bases 19A, 19B, and 19C 
basically store a grammar rule such as that stored in the 
grammar data base 7 shown in Fig. 1 , described above. 30 
[0115] The grammar data base 19B stores a highly 
precise grammar rule to which more precise processing 
can be applied than that applied to a grammar rule 
stored in the grammar data base 1 9A. The grammar da- 
ta base 19C stores a highly precise grammar rule to 35 
which more precise processing can be applied than that 
applied to the grammar rule stored in the grammar data 
base 19B. More specifically, when the grammar data 
base i 9A stores, for example, a grammar rule based on 
unigram (occurrence probabilities of words), the gram- *o 
mar data base 19B stores, for example, blgram (occur- 
rence probabilities of words with a relationship with 
words disposed immediately therebefore being taken in- 
to account). In this case, the grammar data base 19C 
stores, for example, a grammar rule based on trigram *s 
(occurrence probabilities of words with relationships 
with words disposed immediately therebefore and 
words disposed one more word before being taken into 
account) and a context-free grammar. 
[0116] As described above, the acoustic-model data so 
base 17A stores one-pattern acoustic models for each 
phoneme and syllable, the acoustic-model data base 
17B stores plural-pattern acoustic models for each pho- 
neme and syllable, and the acoustic-model data base 
1 7C stores more-pattern acoustic models for each pho- ss 
neme and syllable. The dictionary data base 18A stores 
one piece of phoneme Information for each word, the 
dictionary data base 18B stores a plurality of pieces of 



eacnwo 



phoneme information for eacrTword, and the dictionary 
data base 1 8C stores more pieces of phoneme informa- 
tion for each word. The grammar data base 1 9A stores 
a simple grammar rule, the grammar data base 19B 
stores a highly precise grammar rule, and the grammar 
data base 19C stores a more highly precise grammar 
rule. 

[0117] The preliminary word-selecting section 13, 
which refers to the acoustic-model data base 17A, the 
dictionary data base 18A, and the grammar data base 
19A, obtains acoustics scores and language scores 
quickly for many words although precision is not high. 
The matching section 14, which refers to the acoustic- 
model data base 1 7B, the dictionary data base 1 8B, and 
the grammar data base 19B, obtains acoustics scores 
and language scores quickly for a certain number of 
words with high precision. The re-evaluation section 15, 
which refers to the acoustic-model data base 1 7C, the 
dictionary data base 18C, and the grammar data base 
19C, obtains acoustics scores and language scores 
qujckly for a few words with higher precision. 
[01 18] The precision of the acoustic models stored in 
the acoustic-model data bases 1 7A to 1 70 are different 
in the above description. The acoustic-model data bas- 
es 1 7A to 1 7C can store the same acoustic models. In 
this case, the acoustic-model data bases 17A to 17C 
can be integrated into one acoustic-model data base. In 
the same way, the word dictionaries of the dictionary da- 
ta bases 18A to 18C can store the same contents, and 
the grammar data bases 1 9A to 1 9C can store the same 
grammar rule. 

[0119] Voice recognition processing executed by the 
voice recognition apparatus shown in Fig. 4 will be de- 
scribed next by referring to a flowchart shown in Fig. 6. 
[01 20] When the user utters, the uttered voice is con^ 
verted to a digital voice data through a microphone 1 
and an AD conversion section 2, and is sent to the fea- 
ture extracting section 3. The feature extracting section 
3 sequentially extracts a voice feature amount from the 
sent voice data in units of frames, and sends it to the 
control section 11. 

[01 21 ] The control section 1 1 recognizes a voice zone 
by some technique, relates a series of feature amounts 
sent from the feature extracting section 3 to the extract- 
ing time of each feature amount in the voice zone, and 
sends them to the feature-amount storage section 12 
and stores them in it. 

[01 22] After the voice zone starts, the control section 
11 also generates a node (hereinafter called an initial 
node, if necessary) indicating the start of the voice zone, 
and sends it to the word-connection -information storage 
section 1 6 and stores In It in step S1 . In other words, the 
control section 11 stores the node Node n shown in Fig. 
5 in the word-connection-information storage section 1 6 
in step S1 . 

[01 23] The processing proceeds to step S2. The con- 
trol section 1 1 determines whether an Intermediate node 
exists by referring to the word-connection information 
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ion storage sec- 



stored in the word-connection-info 
tion 16. 

[01 24] As described above, in the word-connection in- 
formation shown In Fig. 5, arcs are connected to ending- 
end nodes to form a pass which extends from the start 
of the voice zone to the end. In step S2, among ending- 
end nodes, a node to which an arc has not yet been 
connected and which does not reach the end of the 
voice zone is searched for as an intermediate node 
(such as the nodes Node 8 , Node 10 , and Node^ in Fig. 
5), and it is determined whether such an intermediate 
node exists. 

[0125] As described above, the voice zone is recog- 
nized by some technique, and the time corresponding 
to an ending-end node is recognized by referring to the 
time information which the ending-end node has. There- 
fore, whether an ending-end node to which an arc has 
not yet been connected does not reach the end of the 
voice zone is determined by comparing the end time of 
the voice zone with the time Information which the end- 
ing-end node has. 

[0126] When it is determined in step S2 that an inter- 
mediate node exists, the processing proceeds to step 
S3. The control section 11 selects one node from inter- 
mediate nodes included in the word-connection infor- 
mation as a node (hereinafter called an aimed-at node, 
if necessary) for determining a word serving as an arc 
to be connected to the node. 

[01271 Specifically, when only one intermediate node 
is included in the word-connection information, the con- 
trol section 11 selects the Intermediate node as an 
aimed-at node. When a plurality of intermediate nodes 
are included in the word-connection information, the 
control section 11 selects one of the plurality of interme- 
diate nodes as an aimed-at node. More specifically, the 
control section 11 refers to the time Information which 
each of the plurality of intermediate nodes has, and se- 
lects the node having the time information which indi- 
cates the oldest time (closest to the start of the voice 
zone), or the node having the time information which in- 
dicates the newest time (closest to the end of the voice 
zone), as an aimed-at node. Alternatively, for example, 
the control section 1 1 accumulates the acoustics scores 
and the language scores which the arcs constituting a 
pass extending from the initial node to each the plurality 
of intermediate nodes have, and selects the intermedi- 
ate node disposed at the ending end of the pass which 
has the largest of accumulated values (hereinafter 
called partial accumulated values, if necessary) or the 
smallest. 

[0128] Then, thecontrol section 11 outputs an instruc- 
tion (hereinafter called a matching processing Instruc- 
tion, if necessary) for performing matching processing 
with the time information which the aimed-at node has 
being used as a starting time, to the matching section 
14 and to the re-evaluation section 15. 
[0129] When the re-evaiuatfon section 15 receives 
the matching processing instruction from the control 



section 1 1 , the processing proceeds to step S4. The re- 
evaluation section 15 recognizes the word string (here- 
inafter called a partial word string) indicated by the arcs 
constituting the pass (hereinafter called a partial pass) 

5 extending from the initial node to the aimed-at node, by 
referring to the word-connection- information storage 
section 16to re-evaluate the partial word string. The par- 
tial word string is, as described later, an intermediate 
result of a word string serving as a candidate for the re- 

10 suit of voice recognition, obtained by matching process- 
ing which the matching section 1 4 applies to words pre- 
liminarily selected by the preliminary word-selecting 
section 13. The re-evaluation section 15 again evalu- 
ates the intermediate result. 

15 [01 30] Specifically, the re-evaluation section 1 5 reads 
the series of feature amounts corresponding to the par- 
tial word string from the feature-amount storage section 
12 to recalculate a language score and an acoustics 
score for the partial word string. More specifically, the 

20 re-evaluation section 15 reads, for example, the series 
(feature-amount series) of feature amounts related to 
the period from the time indicated by the time informa- 
tion which the initial node, the beginning node of the par- 
tial pass, has to the time indicated by the time informa- 

25 tion which the aimed-at node has, from the feature- 
amount storage section 12. In addition, the re-evalua- 
tion section 15 re-calculates a language score and an 
acoustics score for the partial word string by referring to 
the acoustic-model data base 1 7C, the dictionary data 

30 base 1 8C, and the grammar data base 1 9C with the use 
of the feature-amount series read from the feature- 
amount storage section 12. This re-calculation is per- 
formed without fixing the word boundaries of the words 
constituting the partial word string. Therefore, the re- 

35 evaluation section 15 determines the word boundaries 
of the words constituting the partial word string accord- 
ing to the dynamic programming method by re-calculat- 
ing a language score and an acoustics score for the par- 
tial word string. 

40 [01 31 ] When the re-evaluation section 1 5 obtains the 
language score, the acoustics score, and the word 
boundaries of each word of the partial word string, the 
re-evaluation section 1 5 uses the new language scores 
and acoustics scores to correct the language scores and 
the acoustics scores which the arcs constituting the par- 
tial pass stored in the word-connection-information stor- 
age section 1 6 corresponding to the partial word string 
have, and also uses the new word boundaries to correct 
the time information which the nodes constituting the 

50 partial pass stored in the word-connection-information 
storage section 16 corresponding to the partial word 
string have. In the present embodiment, the re-evalua- 
tion section 15 corrects the word-connect ion information 
through the control section 11 . 

ss p)1 32] When the node Nodes shown in Fig. 7 is set to 
an aimed-at node, for example, if a word string "ii" and 
"tenki" formed of the node Node 3 , the arc Arc3 corre- 
sponding to the word "ii," the node Node 4 , the arc Arc 4 
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corresponding to the word "tenki.^PFthe node 5 is ex- 
amined within the partial pass extending from the initial 
node Node., to the aimed-at node Node 5 , the re-evalu- 
atlon section 15 generates word models for the words 
"ii° and "tenki ,° and calculates acoustics scores by re- 
ferring to the acoustic-model data base 1 7C and the dic- 
tionary data base 18C with the use of the feature- 
amount series from the time corresponding to the node 
Node 3 to the time corresponding to the node Node 5 . The 
re-evaluation section 15 also calculates language 
scores for the words "ii" and "tenki" by referring to the 
grammar data base 19C. More specifically, when the 
grammar data base 1 9C stores a grammar rule based 
on trigram, for example, the re-evaluation section 1 5 us- 
es, for the word "II," the word "wa" disposed Immediately 
therebefore and the word "kyou" disposed one more 
word before to calculate the probability of a word chain 
"kyou," "wa," and "ii" in that order, and calculates a lan- 
guage score according to the obtained probability. The 
re-evaluation section 15 uses, for the word "tenkl," the 
word "ii" disposed immediately therebefore and the 
word "wa" disposed one more word before to calculate 
the probability of a word chain "wa," "ii," and "tenki" in 
that order, and calculates a language score according 
to the obtained probability. 

[0133] The re-evaluation section 15 accumulates 
acoustics scores and language scores obtained as de- 
scribed above, and determines the word boundary be- 
tween the words "ii" and "tenki" so as to obtain the larg- 
est accumulated value. The re-evaluation section 1 5 us- 
es the obtained acoustics scores and language scores 
to correct the acoustics scores and the language scores 
which the arc A1C3 corresponding to the word "ii" has 
and the arc Arc 4 corresponding to the word "tenki" has, 
and uses the determined word boundary to correct the 
time information which the node Node 4 corresponding 
to the word boundary between the words "ii" and "tenki" 
has. 

[0134] Therefore, the re-evaluation section 15 deter- 
mines the word bou ndaries of the words constituting the 
partial word string by the dynamic programming meth- 
od, and sequentially corrects the word-connection infor- 
mation stored in the word-connection -information stor- 
age section 16. Since the preliminary word-selecting 
section 13 and the matching section 14 perform 
processing by referring to the corrected word-connec- 
tion information, the precision and reliability of the 
processing are improved. 

[0135] In addition, since the re-evaluation section 15 
corrects word boundaries included in the word-connec- 
tion information, the number of word-boundary candi- 
dates to be stored in the word-connection Information 
can be largely reduced to make an efficient use of the 
memory capacity. 

[0136] In other words, conventionally, three times t^ , 
t|, and t 1+1 need to be held as word-boundary candi- 
dates between the words "kyou" and "wa" as described 
before by referring to Fig. 2. If the time t 1f which is the 
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correct word boundary, is erWFieousry not held, match- 
ing processing thereafter is adversely affected. In con- 
trast, when the re-evaluation section 15 sequentially 
corrects word boundaries, even If only the time t,..,, 
5 which is an erroneous word boundary, is held, for exam- 
ple, the re-evaluation section 1 5 changes the time t 1 . 1 , 
which is an erroneous word boundary, to the time t 1t 
which is the correct word boundary. Therefore, matching 
processing thereafter is not adversely affected. 
10 [0137] The re-evaluatlon section 15 uses cross-word 
models in which words disposed before and after a tar- 
get word are taken into account, for words constituting 
the partial word string except the top and end words to 
calculate acoustics scores. Words disposed before and 
15 after a target word can be taken into account also In the 
calculation of language scores. Therefore, highly pre- 
cise processing is made possible. Furthermore, since 
the re-evaluation section sequentially performs 
processing, a large delay which occurs in two-pass de- 
20 coding, described before, does not happen. 

[01 38] When the re-evaluation section 1 5 has correct- 
ed the word-connection information stored in the word- 
connection-information storage section 1 6 as described 
above, the re-evaluation section 15 reports the compte- 
rs tion of correction to the matching section 1 4 through the 
control section 1 1 . 

[0139] As described above, afterthe matching section 
14 receives the matching processing instruction from 
the control section 11 , when the matching section 14 is 

30 reported by the re-evaluation section 15 through the 
control section 11 that the word-connection Information 
has been corrected, the matching section 1 4 sends the 
aimed-at node and the time information which the 
aimed-at node has to the preliminary word-selecting 

35 section 1 3 and asks to apply preliminary word-selecting 
processing, and the processing proceeds to step S5. 
[0140] In step S5, when the preliminary word-select- 
ing section 13 receives the requests for preliminary 
word-selecting processing from the matching section 

to 14, the preliminary word-selecting section 13 applies 
preliminary word-selecting processing for selecting a 
word candidate serving as an arc to be connected to the 
aimed-at node, to the words stored in the word diction- 
ary of the dictionary data base 1 8A. 

45 [0141] More specifically, the preliminary word-select- 
ing section 1 3 recognizes the starting time of a series of 
feature amounts used for calculating a language score 
and an acoustics score, from the time information which 
the aimed-at node has, and reads the required series of 

so feature amounts, starting from the starting time, from the 
feature-amount storage section 12. The preliminary 
word-selecting section 13 also generates a word model 
for each word stored in the word dictionary of the dic- 
tionary data base 18A by connecting acoustic models 

55 stored in the acoustic-model data base 1 7A, and calcu- 
lates an acoustics score according to the word model 
by the use of the series of feature amounts read from 
the feature-amount storage section 12. 
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[0142] The preliminary word-sel^PPTg section 13 cal- 
culates the language score of the word corresponding 
to each word model according to the grammar rule 
stored In the grammar data base 1 9A. Specifically, the 
preliminary word-selecting section 13 obtains the lan- 
guage score of each word according to, for example, 
unigram. 

[0143] It is possible that the preliminary word-select- 
ing section 13 uses cross-word models depending on 
words (words corresponding to arcs having the almed- 
at node as ending ends) disposed immediately before 
target words to calculate the acoustics score of each 
word by referring to the word-connection information. 
[01 44] It is also possible that the preliminary word-se- 
lecting section 1 3 calculates the language score of each 
word according to bigram which specifies the probability 
of chaining the target word and a word disposed there- 
before by. referring to the word-connection information. 
[0145] When the preliminary word-selecting section 
13 obtains the acoustics score and language score of 
each word, as described above, the preliminary word- 
selecting section 13 obtains a score (hereinafter called 
a word score, if necessary) which is a total evaluation 
of the acoustics score and the language score, and 
sends L words having higher word scores to the match- 
ing section 14 as words to which matching processing 
is to be applied. 

[01 46] The preliminary word-selecting section 1 3 se- 
lects a word according to the word score which is a total 
evaluation of the acoustics score and the language 
score of each word. It is also possible that the prelimi- 
nary word-selecting section 1 3 selects words according 
to, for example, only acoustics scores or only language 
scores. 

[01 47] It is also possible that the preliminary word-se- 
lecting section 13 uses only the beginning portion of the 
series of feature amounts read from the feature-amount 
storage section 12 to obtain several phonemes for the 
beginning portion of the corresponding word according 
to the acoustic models stored in the acoustic-model data 
base 1 7A, and selects words In which the beginning por- 
tions thereof match the obtained phonemes. 
[0148] It is further possible that the preliminary word- 
selecting section 1 3 recognizes the part of speech of the 
word (word corresponding to the arc having the aimed- 
at node as an ending-end node) disposed immediately 
before the target word by referring to the word-connec- 
tion information, and selects words serving as a part of 
speech which is likely to follow the recognized part of 
speech. 

[01 49] The preliminary word-selecting section 1 3 may 
use any word-seiectihg method. Ultimately, words may 
be selected at random. 

[0150] When the matching section 14 receives the L 
words (hereinafter called selected words) used in 
matching processing from the preliminary word-select- 
ing section 13, the matching section 14 applies match- 
ing processing to the selected words in step S6. 
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[01 51 ] Specifically, the maTEMing section 1 4 recogniz- 
es the starting time of a series of feature amounts used 
for calculating a language score and an acoustics score, 
from the time information which the aimed-at node has, 

s and reads the required series of feature amounts, start- 
ing from the starting time, from the feature-amount stor- 
age section 1 2. The matching section 1 4 recognizes the 
phoneme information of the selected words sent from 
the preliminary word-selecting section 13 by referring to 

10 the dictionary data base 1 8B, reads the acoustic models 
corresponding to the phoneme information from the 
acoustic-model database 17B, and connects the acous- 
tic models to form word models. 
[01 52] The matching section 1 4 calculates the acous- 

15 tics scores of the selected words sent from the prelimi- 
nary word-selecting section 13 by the use of the feature- 
amount series read from the feature-amount storage 
section 1 2, according to the word models formed as de- 
scribed above. It is possible that the matching section 

20 14 calculates the acoustics scores of the selected words 
by referring to the word-connection information, accord- 
ing to cross-word models. 

[0153] The matching section 14 also calculates the 
language scores of the selected words sent from the 
25 preliminary word-selecting section 1 3 by referring to the 
grammar data base 1 9B. Specifically, the matching sec- 
tion 1 4 refers to, for example, the word-connection in- 
formation to recognize words disposed immediately be- 
fore the selected words sent from the preliminary word- 
so selecting section 1 3 and words disposed one more word 
before, and obtains the language scores of the selected 
words sent from the preliminary word-selecting section 
13 by the use of probabilities based on bigram or tri- 
gram. 

35 pi 54] The matching section 1 4 obtains the acoustics 
scores and the language scores of ail the L selected 
words sent from the preliminary word-selecting section 
13, as described above, and the processing proceeds 
to step S7. In step S7, for each selected word, a word 

40 score which is a total evaluation of the acoustics score 
and the language score of the word Is obtained, and the 
word-connection information stored in the word-con nec- 
tion-information storage section 1 6 is updated accord- 
ing to the obtained word scores. 

45 [0155] In other words, in step S7, the matching sec- 
tion 14 obtains the word scores of the selected words, 
and, for example, compares the word scores with a pre- 
determined threshold to narrow the selected words 
down to words which can serve as an arc to be connect- 

50 ed to the aimed-at node. Then, the matching section 1 4 
sends the words obtained by narrowing down to the con- 
trol section 1 1 together with the acoustics scores there- 
of, the language scores thereof, and the ending times 
thereof. 

55 [0156] The matching section 14 recognizes the end- 
ing time of. each word from the extracting time of the 
feature amount used for calculating the acoustics score. 
When a plurality of extracting times which are highly like- 
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ly to serve as the ending time of^HJBrd are obtained, 
sets of each ending time, the corresponding acoustics 
score, and the corresponding language score of the 
word are sent to the control section 11 . 
[0157] When the control section 11 receives the 5 
acoustics score, language score, and ending time of 
each word from the matching section 14, as described 
above, the control section uses the aimed-at node in the 
word-connection information (Fig. 5) stored in the word- 
connection -information storage section 16 as a starting 10 
node, extends an arc, and connect the arc to the ending- 
end node corresponding to the ending time, for each 
word. The control section 1 1 also assigns to each arc 
the corresponding word, the corresponding acoustics 
score, and the corresponding language score, and gives 15 
the corresponding end time as time information to the 
ending-end node of each arc. Then, the processing re- 
turns to step S2, and the same processes are repeated. 
[0158] As described above, the word-connection in- 
formation Is sequentially updated according to the re- 20 
suits of processing executed in the matching section 1 4, 
and further, sequentially updated by the re-evaluation 
section 15. Therefore, it is made possible that the pre- 
liminary word-selecting section 13 and the matching 
section 14 always use the word-connection information 25 
for their processing. 

[0159] The control section 11 integrates, if possible, 
two ending-end nodes into one, as described above, 
when updating the word-connection information. 
[0160] When it is determined in step S2 that there is 30 
no intermediate node, the processing proceeds to step 
S8. The control section 1 1 refers to the word-connection 
information to accumulate word scores for each pass 
formed in the word-connection information to obtain the 
final score, outputs, for example, the word string corre- 35 
sponding to the arcs constituting the pass which has the 
highest final score as the result of voice recognition for 
the user's utterance, and terminates the processing. 
[0161] As described above, the preliminary word-se- 
lecting section 13 selects one or more words following 40 
words which have been obtained in a word string serving 
as a candidate for a result of voice recognition; the 
matching section 14 calculates scores for the selected 
words, and form a word string serving as a candidate 
for a result of voice recognition according to the scores; *5 
the re-evaluation section 15 corrects word-connection 
relationships between words in the word string serving 
as a candidate for a result of voice recognition; and the 
control section 1 1 determines a word string serving as 
the result of voice recognition according to the corrected so 
word-connection relationships. Therefore, highly pre- 
cise voice recognition Is performed while an increase of 
resources required for processing is suppressed. 
[0162] Since the re-evaluation section 15 corrects 
word boundaries in the word-connection information, s $ 
the time information which the aimed-at node has indi- 
cates a word boundary highly precisely. The preliminary 
word-selecting section 13 and the matching section 14 
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perform processing by the use of a series of feature 
amounts from the time indicated by the highly precise 
time information. Therefore, even when a determination 
reference for selecting words In the preliminary word- 
selecting section 13 and a determination reference for 
narrowing the selected words in the matching section 
14 are made strict, a possibility of excluding a correct 
word which serves as a result of voice recognition is 
made very low. 

[0163] When the determination reference for select- 
ing words in the preliminary word-selecting section 13 
is made strict, the number of words to which the match- 
ing section 14 applies matching processing is reduced. 
As a result, the amount of calculation and the memory 
capacity required for the processing In the matching 
section 14 are also reduced. 

[0164] When the preliminary word-selecting section 
13 does not select a word starting from a certain time, 
which is one of the words constituting the word string 
serving as the correct result of voice recognition, at that 
time, if the word is selected at an erroneous time shifted 
from the certain time, the re-evaluation section 15 cor- 
rects the erroneous time, and the word string serving as 
the correct result of voice recognition is obtained. In oth- 
er words, even if the preliminary word-selecting section 
13 fails to select a word which is one of the words con- 
stituting the word string serving as the correct result of 
voice recognition, the re-evaluation section 15 corrects 
the failure of selection to obtain the word string serving 
as the correct result of voice recognition. 
[0165] Therefore, the re-evaluation section 15 cor- 
rects an erroneous word selection executed by the pre- 
liminary word-selecting section 13 in addition to an er- 
roneous detection of an end time executed by the 
matching section 14. 

[01 66] The series of processing described above can 
be implemented by hardware or software. When the se- 
ries of processing is achieved by software, a program 
constituting the software is installed into a general-pur- 
pose computer and the like. 

[0167] Fig. 8 shows an example structure of a com- 
puter in which a program for executing the series of 
processing described above is installed, according to an 
embodiment. 

[0168] The program can be recorded in advance into 
a hard disk 1 05 or a read-only memory (ROM) 1 03 serv- 
ing as a recording medium wh ich is built in the computer. 
[0169] Alternatively, the program is recorded tempo- 
rarily or perpetually into a removable recording medium 
111, such as a floppy disk, a compact disc read-only 
memory (CD-ROM), a magneto-optical (MO) disk, a dig- 
ital versatile disk (DVD), a magnetic disk, or a semicon- 
ductor memory. Such a removable recording medium 
111 can be provided as so-called package software. 
[01 70] The program may be installed from the remov- 
able recording medium 111, described above, to the 
computer. Alternatively, the program is transferred by 
radio from a downloading site to the computer through 
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an artificial satellite for digital satHBr broadcasting, or 
to the computer by wire through a network such as a 
local area network (LAN) or the Internet; is received by 
a communication section 1 08 of the computer; and Is 
installed into the hard disk 105, built in the computer. 
[0171J Th© computer includes a central processing 
unit (CPU) 102. The CPU 102 is connected to an input 
and output interface 110 through a bus 101. When the 
user operates an input section 107 formed of a key- 
board, a mouse, and a microphone to Input a command 
through the input and output Interface 1 1 0, the CPU 1 02 
executes a program stored in the ROM 103 according 
to the command. Alternatively, the CPU 1 02 loads into 
a random access memory (RAM) 1 04 a program stored 
In the hard disk 105; a program transferred through a 
satellite or a network, received by the communication 
section 108, and installed into the hard disk 105; or a 
program read from the removable recording medium 
111 mounted to a drive 109, and installed into the hard 
disk 105; and executes It. The CPU executes the 
processing illustrated in the above flowchart, or 
processing performed by the structure shown In the 
above block diagram. Then, the CPU 102 outputs the 
processing result as required, for example, through the 
input and output interface 110 from ah output section 
1 06 formed of a liquid crystal display (LCD) and a speak- 
er; transmits the processing result from the communi- 
cation section 108; or records the processing result in 
the hard disk 105. 

[01 72] In the present specification, the steps describ- 
ing the program for making the computer execute vari- 
ous types of processing are not necessarily executed in 
a time-sequential manner in the order described in the 
flowchart and include processing (such as parallel 
processing or object-based processing) executed in 
parallel or separately: 

[01 73] The program may be executed by one compu- 
ter or may be distribution-processed by a plurality of 
computers. The program may also be transferred to a 
remote computer and executed. 

[0174] Since words for which the matching section 14 
calculates scores have been selected in advance by the 
preliminary word-selecting section 13, the matching 
section 1 4 can calculate scores for each word independ- 
ently without forming a tree-structure network in which 
a part of acoustics-score calculation is shared, as de- 
scribed above. In this case, the capacity of a memory 
used by the matching section 1 4 to calculate scores for 
each word is suppressed to a low level. In addition, in 
this case, since each word can be identified when a 
score calculation is started for the word, a wasteful cal- 
culation is prevented which is otherwise performed be- 
cause the word is not identified. In other words, before 
an acoustics score is calculated for a word, a language 
score is calculated and branch cutting is executed ac- 
cording to the language score, so that a wasteful acous- 
tics-score calculation Is prevented. 
[01 75] The preliminary word-selecting section 1 3, the 
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matching section 14, and fnTre-evaluation section 15 
can calculate scores for each word independently in 
terms of time. In this case, the same memory required 
for the score calculation can be shared to suppress the 

5 required memory capacity to a low level. 

[0176] The voice recognition apparatus shown in Fig. 
4 can be applied to voice interactive systems used in a 
case in which a data base is searched by voice, in a 
case in which various types of units are operated by 

10 voice, and In a case in which data is input to each unit 
by voice. More specifically, for example, the voice rec- 
ognition apparatus can be applied to a data-base 
searching apparatus for displaying map information in 
response to an inquiry of the name of a place by voice, 

15 an Industrial robot for classifying materials in response 
to an instruction by voice, a dictation system for gener- 
ating texts in response to a voice input instead of a key- 
board input, and an interactive system in a robot fortalk- 
ing with a user. 

20 [01771 According to a voice recognition apparatus 
and a voice recognition method, and a recording medi- 
um of the present invention, one or more words are se- 
lected from a group of words to which voice recognition 
is applied, to serve as words following words which have 

25 been obtained in a word string serving as a candidate 
for a result of voice recognition; scores are calculated 
for the selected words; and a word string serving as a 
candidate for a result of voice recognition is formed. 
Connection relationships between words in the word 

30 string serving as a candidate for a result of voice recog- 
nition are corrected, and a word string serving as the 
result of voice recognition is determined according to the 
corrected connection relationships. Therefore, highly 
precise voice recognition is implemented while an in- 

35 crease of resources required for processing is sup- 
pressed. 

[0178] In so far as the embodiments of the invention 
described above are implemented, at least in part, using 
software-controlled data processing apparatus, it will be 
40 appreciated that a computer program providing such 
software control and a storage medium by which such 
a computer program is stored are envisaged as aspects 
of the present invention. 

[0179] Combinations of features from the dependant 
45 claims may be combined with features of the independ- 
ent claims as appropriate and not merely as explicitly 
set out in the claims. 



so Claims 

1. A voice recognition apparatus for calculating a 
score indicating the likelihood of a result of voice 
recognition applied to an input voice and for recog- 
55 nizing the voice according to the score, comprising: 

selecting means for selecting one or more 
words following words which have been ob- 
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tained in a word string sefflffg as a candidate 
for a result of the voice recognition, from a 
group of words to which voice recognition is ap- 
plied; 

forming means for calculating the scores for the 5 
words selected by the selecting means, and for 
forming a word string serving as a candidate for 
a result of the voice recognition according to the 
scores; 

storage means for storing word-connection re- 10 
lationships between words in the word string 
serving as a candidate for a result of the voice 
recognition; 

correction means for correcting the word-con- 
nection relationships; and is 
determination means for determining a word 
string serving as the result of the voice recog- 
nition according to the corrected word-connec- 
tion relationships. 
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2. A voice recognition apparatus according to Claim 

1 , wherein the storage means stores the connection 
relationships by using a graph structure expressed 
by a node and an arc. 

3. A voice recognition apparatus according to Claim 

2, wherein the storage means stores nodes which 
can be shared as one node. 



4. A voice recognition apparatus according to Claim 30 
1 , wherein the storage means stores the acoustic 
score and the linguistic score of each word, and the 
starting time and the ending time of the utterance 
corresponding to each word, together with the con- 
nection relationships between words. 35 

5. A voice recognition apparatus according to Claim 
1, wherein the forming means forms a word string 
serving as a candidate for a result of the voice rec- 
ognition by connecting the words for which the *o 
scores are calculated to a word for which a score 
has been calculated, and 

the correction means sequentially corrects 
the connection relationships every time a word is 
connected by the forming means. 45 12. 

6. A voice recognition apparatus according to Claim 
1 , wherein one of the selecting means and the form- 
ing means performs processing while referring to 

the connection relationships. so 



A voice recognition apparatus according to Claim 
7, wherein one of the selecting means, the forming 
means, and the correction means calculates an 
acoustic or linguistic score for each word independ- 
ently. 

A voice recognition apparatus according to Claim 
7, wherein one of the selecting means, the forming 
means, and the correction means calculates an 
acoustic or linguistic score for each word independ- 
ently in terms of time. 

A voice recognition apparatus according to Claim 
7, wherein the correction means calculates an 
acoustic or linguistic score for a word by referring 
to the connection relationships with a word dis- 
posed before or after the word for which a score is 
to be calculated being taken into account. 

A voice recognition method for calculating a score 
indicating the likelihood of a result of voice recogni- 
tion applied to an input voice and for recognizing 
the voice according to the score, comprising: 

a selecting step of selecting one or more words 
following words which have been obtained in a 
word string serving as a candidate for a result 
of the voice recognition, from a group of words 
to which voice recognition is applied; 
a forming step of calculating the scores for the 
words selected in the selecting step, and of 
forming a word string serving as a candidate for 
a result of the voice recognition according to the 
scores; 

a correction step of correcting word-connection 
relationships between words In the word string 
serving as a candidate for a result of the voice 
recognition, the word-connection relationships 
being stored in storage means; and 
a determination step of determining a word 
string serving as the result of the voice recog- 
nition according to the corrected word-connec- 
tion relationships. 

A recording medium storing a program which 
makes a computer execute voice-recognition 
processing for calculating a score indicating the 
likelihood of a result of voice recognition applied to 
an input voice and for recognizing the voice accord- 
ing to the score, the program comprising: 



A voice recognition apparatus according to Claim 
1 , wherein one of the selecting means, the forming 
means, and the correction means calculates an 
acoustic or linguistic score for a word, and performs 
processing according to the acoustic or linguistic 
score. 



a selecting step of selecting one or more words 
following words which have been obtained in a 
word string serving as a candidate for a result 
55 of the voice recognition, from a group of words 

to which voice recognition is applied; 
a forming step of calculating the scores for the 
words selected in the selecting step, and of 
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forming a word string sei 



is a candidate for 



a result of the voice recognition according to the 
scores; 

a correction step of correcting word-connection 
relationships between words in the word string s 
serving as a candidate for a result of the voice 
recognition, the word-connection relationships 
being stored in storage means; and 
a determination step of determining a word 
string serving as the result of the voice recog- 10 
nition according to the corrected word-connec- 
tion relationships. 
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FIG. 6 
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FIG. 7 
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FIG. 8 
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